POST: Using Probabilities in Language Processing

نویسندگان

  • Marie Meteer
  • Richard M. Schwartz
  • Ralph M. Weischedel
چکیده

We report here on our experiments with POST (Part of Speech Tagger) to address problems of ambiguity and of understanding unknown words. Part of speech tagging, per se, is a well understood problem. Our paper reports experiments in three important areas: handling unknown words, l imit ing the size of the training set, and returning a set of the most likely tags for each word rather than a single tag. We describe the algorithms that we used and the specific results of our experiments on Wall Street Journal articles and on MUC terrorist messages. 1. In t roduct ion 1 Natural language processing, and Al in general, have focused mainly on building rule-based systems with carefully handcrafted rules and domain knowledge. Our own natural language database query systems, JANUS 2 , ParlanceTM and Delph i 4 , use these techniques quite successfully. However, as we move from the problem of understanding queries in fixed domains to processing open text for applications such as data extraction, we have found rule-based techniques too brittle, and the amount of work necessary to bui ld them intractable, especially when attempting to use the same system on multiple domains. We report in this paper on one application of probabilistic models to language processing, the assignment of part of speech to words in open text. The effectiveness of such models is well known [DeRose, 1988; Church, 1988; Kupiec, 1989; Jelinek, 1985] and they are currently in use in parsers [e.g. de Marcken, 1990]. Our work is an incremental improvement on these models in two ways: (1) We have 1 The work reported here was supported by the Advanced Research Projects Agency and was monitored by the Rome Air Development Center under Contract No. F30602-87-D-OO93. The views and conclusions contained in this document are those of the authors and should not be interpreted as necessarily representing the official policies, whether expressed or implied, of the Defense Advanced Research Projects Agency or the United

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

مقایسه روش های طیفی برای شناسایی زبان گفتاری

Identifying spoken language automatically is to identify a language from the speech signal. Language identification systems can be divided into two categories, spectral-based methods and phonetic-based methods. In the former, short-time characteristics of speech spectrum are extracted as a multi-dimensional vector. The statistical model of these features is then obtained for each language. The ...

متن کامل

Efficient OCR Post-Processing Combining Language, Hypothesis and Error Models

In this paper, an OCR post-processing method that combines a language model, OCR hypothesis information and an error model is proposed. The approach can be seen as a flexible and efficient way to perform Stochastic Error-Correcting Language Modeling. We use Weighted Finite-State Transducers (WFSTs) to represent the language model, the complete set of OCR hypotheses interpreted as a sequence of ...

متن کامل

Heuristic Approach for Specially Structured Two Stage Flow Shop Scheduling to Minimize the Rental Cost, Processing Time, Set Up Time Are Associated with Their Probabilities Including Transportation Time and Job Weightage

The present paper is an attempt to develop a new heuristic algorithm, find the optimal sequence to minimize the utilization time of the machines and hence their rental cost for two stage specially structured flow shop scheduling under specified rental policy in which processing times and set up time are associated with their respective probabilities including transportation time. Further jo...

متن کامل

A word language model based contextual language processing on Chinese character recognition [7534-22]

The language model design and implementation issue is researched in this paper. Different from previous research, we want to emphasize the importance of n-gram models based on words in the study of language model. We build up a word based language model using the toolkit of SRILM and implement it for contextual language processing on Chinese documents. A modified Absolute Discount smoothing alg...

متن کامل

A word language model based contextual language processing on Chinese character recognition

The language model design and implementation issue is researched in this paper. Different from previous research, we want to emphasize the importance of n-gram models based on words in the study of language model. We build up a word based language model using the toolkit of SRILM and implement it for contextual language processing on Chinese documents. A modified Absolute Discount smoothing alg...

متن کامل

A Reflection on Kristeva's Approach to the Structure of Language

Reaching out to history and subject in terms of meaning variation, Kristeva could show that language cannot simply be a Saussurean sign system. Rather, she went on to delineate that language, beyond signs, is associated with a dynamic system of signification where the ''speaking subject'' is constantly involved in processing. Julia Kristeva, a French critic, psychoanalyst, theoretician, a post-...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1991